Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Young Soo Choi
August 16, 2022
Today’s challenge is to
I choosed ’StateCounty2012.xls’file and read it. But it has first 3 rows about title and any others. So I removed the first 3 rows using skip code.
# A tibble: 2,990 × 5
STATE ...2 COUNTY ...4 TOTAL
<chr> <lgl> <chr> <lgl> <dbl>
1 AE NA APO NA 2
2 AE Total1 NA <NA> NA 2
3 AK NA ANCHORAGE NA 7
4 AK NA FAIRBANKS NORTH STAR NA 2
5 AK NA JUNEAU NA 3
6 AK NA MATANUSKA-SUSITNA NA 2
7 AK NA SITKA NA 1
8 AK NA SKAGWAY MUNICIPALITY NA 88
9 AK Total NA <NA> NA 103
10 AL NA AUTAUGA NA 102
# … with 2,980 more rows
# ℹ Use `print(n = ...)` to see more rows
[1] "STATE" "...2" "COUNTY" "...4" "TOTAL"
This dataset has 2990 rows and 5 columns. And each column name is “STATE”, “…2”, “COUNTY”, “…4”, “TOTAL”.
This dataset is about the number of workers related to railroad jobs in 2012 I think. And this dataset contains data of 2990 county.
I choosed MA and CA because MA is the state where I live in and CA is much bigger state I think. I used filter code to figure out MA’s and CA’s central tendency numbers.
STATE ...2 COUNTY ...4
Length:12 Mode:logical Length:12 Mode:logical
Class :character NA's:12 Class :character NA's:12
Mode :character Mode :character
TOTAL
Min. : 44.0
1st Qu.:101.8
Median :271.0
Mean :281.6
3rd Qu.:396.8
Max. :673.0
STATE ...2 COUNTY ...4
Length:55 Mode:logical Length:55 Mode:logical
Class :character NA's:55 Class :character NA's:55
Mode :character Mode :character
TOTAL
Min. : 1.0
1st Qu.: 12.5
Median : 61.0
Mean : 238.9
3rd Qu.: 200.5
Max. :2888.0
MA has 12 county and average total is 286.1 and CA has 55 county and average total is 238.9. So I figure CA has more county than MA but MA’s average total is larger than CA’s.
---
title: "Challenge 2"
author: "Young Soo Choi"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
I choosed 'StateCounty2012.xls'file and read it. But it has first 3 rows about title and any others. So I removed the first 3 rows using skip code.
```{r}
library(readxl)
state2012 <- read_xls("_data/StateCounty2012.xls", skip=3)
state2012
colnames(state2012)
```
This dataset has 2990 rows and 5 columns. And each column name is "STATE", "...2", "COUNTY", "...4", "TOTAL".
## Describe the data
This dataset is about the number of workers related to railroad jobs in 2012 I think. And this dataset contains data of 2990 county.
```{r}
#| label: summary
summary(state2012)
```
## Provide Grouped Summary Statistics
I choosed MA and CA because MA is the state where I live in and CA is much bigger state I think. I used filter code to figure out MA's and CA's central tendency numbers.
```{r}
MA<-filter(state2012, STATE=="MA")
summary(MA)
CA<-filter(state2012, STATE=="CA")
summary(CA)
```
### Explain and Interpret
MA has 12 county and average total is 286.1 and CA has 55 county and average total is 238.9. So I figure CA has more county than MA but MA's average total is larger than CA's.